Train Set

This notebook is used as a "first look" of the kaggle competition "Google Landmark Recognition 2020" train set.

We would like to look at some of the images from the train set:

As we can see from this small sample, the images of the test set differ in size and color. In addition, they are images of very different landmarks and some of them are not necessarily of the landmark itself but its interior.

Some of the train set properties:

As we can see, the data set composed out of 1,580,470 pictures that divided into 81,313 classes. This amount of objects and classes in the data set makes it a really challenging one.

The train set histogram:

And zoomed-in:

As we can see from the histogram plot of the train set, there are some huge variation in the number of objects in every class. Therefore the train set distribution will be long-tailed.

We would like to look more carefully at the top and bottom classes in the data set.

We'll start with the top classes:

We will look at some of the images from the top 10 classes:

As we can see from the table and from the graph, class number 138982 is the biggest class by a big margin. It contain 6272 where none of the rest top classes contain more than 2500 objects.

We will look now at 12 images from the top 5 classes:

As we can see from the images above, the difference between the images in the same class is big and in some cases it is not clear why so different images are part of the same class. This is another major challenge of the data set, the large intra-class variability.

We'll look now at the top 50 classes:

This graph simulate one of the major challanges of the given train set - the long-tailed distribtuion, which is quite long, even for only 50 classes (out of more than 80K).

We'll look now at the bottom classes. We'll start from the bottom 10:

As we can see all the 10 bottom classes have only 2 objects in them.

The 50 bottom classes, like to bottom 10, have only 2 objects in them.

As a matter of fact more than half of the classes have 9 objects or less:

This is another aspect of the long-tailed distribution, a lot of classes have small amount of objects, this will make the training process more difficult and challenging.

This is the train set we will use to train our algorithm.